{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4 - Evaluating models and intro to Pipelines\n",
    "\n",
    "In this chapter, you will:\n",
    "\n",
    "• Learn to evaluate your ML model\n",
    "\n",
    "• Learn to use Pipelines"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.sql import SparkSession \n",
    "from pyspark.ml import Pipeline\n",
    "from pyspark.ml.feature import VectorIndexer\n",
    "from pyspark.ml.regression import LinearRegression, LinearRegressionModel\n",
    "from pyspark.ml.fpm import FPGrowth, FPGrowthModel\n",
    "\n",
    "spark = SparkSession.builder \\\n",
    "    .master('local[*]') \\\n",
    "    .appName(\"Pipelines\") \\\n",
    "    .getOrCreate()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Load machine learning model :\n",
    "\n",
    "\n",
    "Load the models from previous Chapter:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lr_model = LinearRegressionModel.load('linearRegression_model')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fpgrowth_model = FPGrowthModel.load('fpGrowth_model')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Evaluate the models:\n",
    "\n",
    "For evaluation, load classified test data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_test = spark.read.parquet(\"classified_test_data\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[From the docs:](https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html)\n",
    "\n",
    "While there are many different types of classification algorithms, the evaluation of classification models all shares similar principles. \n",
    "\n",
    "In a supervised classification problem, there exists a _true output_ and a _model-generated predicted output_ for each data row. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 1: Evaluate your ML model "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### RegressionEvaluator functionality:\n",
    "\n",
    "\n",
    "✅ **Task :** \n",
    "\n",
    "\n",
    "Start with predicting the outcome:\n",
    "Use predict function\n",
    "\n",
    "```python\n",
    "    model.transform(vectorOfFeatures).select('prediction').show()\n",
    "```\n",
    "\n",
    "Notice that transform takes a vector of features as input.\n",
    "\n",
    "Prediction represents if it's a bot or not.\n",
    "1- bot\n",
    "0- human"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import VectorAssembler\n",
    "\n",
    "test = df_test.drop('description')\n",
    "vecAssembler = VectorAssembler(inputCols=['screen_name','location','followers_count','friends_count','listed_count','favourites_count','verified','statuses_count','status','default_profile','name'], outputCol=\"features\", handleInvalid = \"skip\")\n",
    "vecAssembler = vecAssembler.transform(test)\n",
    "\n",
    "model_test_prediction = lr_model.transform(vecAssembler)\n",
    "model_test_prediction.select('bot','prediction').show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model gave us a prediction of the chances for a specific row to\n",
    "be a bot. We got numbers like 0.147 and 0.1021.\n",
    "\n",
    "It is up to us to define the **threshold** for classifying a bot.\n",
    "If it shows us 0.9? Will it satisfy us? How certain do we want to be in the classification?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**RegressionEvaluator** is the evaluator for regression-based models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use regressionEvaluator to evaluate the model.\n",
    "\n",
    "\n",
    "Code Sample\n",
    "```pyhon\n",
    "    from pyspark.ml.evaluation import RegressionEvaluator\n",
    "    lr_evaluator = RegressionEvaluator(predictionCol=\"prediction\", labelCol=\"bot\",metricName=\"r2\")\n",
    "    R2 = lr_evaluator.evaluate(model_test_prediction)\n",
    "```\n",
    "\n",
    "Check out R-Square :\n",
    "\n",
    "\n",
    ">R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. 100% indicates that the model explains all the variability of the response data around its mean\n",
    "\n",
    "From: [RegressionAnalysis](https://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "scrolled": true
   },
   "source": [
    "Notice `metricName` param:\n",
    "\n",
    "RegressionEvaluator Supports: - `rmse` (default): root mean squared error - `mse`: mean squared error - `r2`: R Sqaure metric - `mae`: mean absolute error\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "**Notice!** here we work with the train data and select both `bot` and `prediction` to get a feel for the classifier\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import VectorAssembler\n",
    "\n",
    "df_train = spark.read.parquet(\"classified_train_data\")\n",
    "\n",
    "train = df_train.drop('description')\n",
    "vecAssemblerTrain = VectorAssembler(inputCols=['screen_name','location','followers_count','friends_count','listed_count','favourites_count','verified','statuses_count','status','default_profile','name'], outputCol=\"features\", handleInvalid = \"skip\")\n",
    "vecAssemblerTrain = vecAssemblerTrain.transform(train)\n",
    "\n",
    "model_train_prediction = lr_model.transform(vecAssemblerTrain)\n",
    "model_train_prediction.select('bot','prediction').show()\n",
    "\n",
    "lr_evaluator = RegressionEvaluator(predictionCol=\"prediction\", labelCol=\"bot\",metricName=\"r2\")\n",
    "R2 = lr_evaluator.evaluate(model_train_prediction)\n",
    "\n",
    "print(\"R Squared (R2) on test data = %g\" % R2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When looking back at the `Predictions` output, we understand that they don't help us much. \n",
    "\n",
    "`Predictions` output is a number between [0,1].\n",
    "\n",
    "\n",
    "However, we expect 1 or 0: bot or human.\n",
    "What can we do?\n",
    "Decide on a threshold.\n",
    "\n",
    "\n",
    "For Example, every prediction above 0.8 is bot. Bellow 0.8 is human.\n",
    "\n",
    "Or maybe every prediction above 0.14?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "✅ Task :\n",
    "\n",
    "Use model statistics params:\n",
    "\n",
    "For example\n",
    "Check RMSE - Root Mean Squared Error\n",
    "For both train and test.\n",
    "\n",
    "Code sample:\n",
    "\n",
    "```python\n",
    "def getLRSummary(df):\n",
    "    df = df.drop('description')\n",
    "    vecAssembler = VectorAssembler(inputCols=['screen_name','location','followers_count','friends_count','listed_count','favourites_count','verified','statuses_count','status','default_profile','name'], outputCol=\"features\", handleInvalid = \"skip\")\n",
    "    vecAssembler = vecAssembler.transform(df)\n",
    "    output = vecAssembler.drop('screen_name','location','followers_count','friends_count','listed_count','favourites_count','verified','statuses_count','status','default_profile','name')\n",
    "    output  = output.selectExpr(\"features\", \"bot as label\")\n",
    "    # evaluate function returns LinearRegressionSummary instance that holds the evaluate results\n",
    "    return lr_model.evaluate(output)\n",
    "\n",
    "train_results = getLRSummary(df_train)\n",
    "print(\"Root Mean Squared Error (RMSE) on train data = %g\" % train_results.rootMeanSquaredError)\n",
    "\n",
    "```\n",
    "\n",
    "Here are function [r2 docs](https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.regression.LinearRegressionSummary.r2)\n",
    "\n",
    "Check on both training and test set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def getLRSummary(df):\n",
    "    df = df.drop('description')\n",
    "    vecAssembler = VectorAssembler(inputCols=['screen_name','location','followers_count','friends_count','listed_count','favourites_count','verified','statuses_count','status','default_profile','name'], outputCol=\"features\", handleInvalid = \"skip\")\n",
    "    vecAssembler = vecAssembler.transform(df)\n",
    "    output = vecAssembler.drop('screen_name','location','followers_count','friends_count','listed_count','favourites_count','verified','statuses_count','status','default_profile','name')\n",
    "    output  = output.selectExpr(\"features\", \"bot as label\")\n",
    "    # evaluate function returns LinearRegressionSummary instance that holds the evaluate results\n",
    "    return lr_model.evaluate(output)\n",
    "\n",
    "\n",
    "# get summary for train set:\n",
    "train_results = # your code goes here\n",
    "print(\"Root Mean Squared Error (RMSE) on train data = %g\" % train_results.rootMeanSquaredError)\n",
    "\n",
    "# get summary for test set:\n",
    "test_results = # your code goes here\n",
    "print(\"Root Mean Squared Error (RMSE) on test data = %g\" % test_results.rootMeanSquaredError)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " `LinearRegressionSummary` gives you a summary of the statistical algorithm evaluations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print(\"r2 on test data = %g\" % test_results.r2)\n",
    "print(\"r2 on train data = %g\" % train_results.r2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### What did you learn?\n",
    "What more evaluating params can you get out of LinearRegressionSummary instance?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "**Reminder**\n",
    "What is r2? R Square: \n",
    ">R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.\n",
    "\n",
    "R Square measure how much of the variability in `bot` / `label` can be explained using the model.\n",
    "We must be cautious that the performance on the training set to avoid overfitting of the model to the training set.\n",
    "Overrfiting can create a model that is good only for the training set and not for the test set.\n",
    "\n",
    "\n",
    "What is RMSE?\n",
    "> Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 2: Build Simple Spark ML Pipelines"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ML Pipelines provide a uniform set of high-level APIs built on top of DataFrames that help us create and tune practical machine learning pipelines."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the previous exercise, you learned Logistic regression.\n",
    "\n",
    "Logistic regression is used when the dependent variable is binary.\n",
    "In our case, bot is binary - yes or no.\n",
    "\n",
    "Linear regression is used to predict the continuous dependent variable. \n",
    "This explains the result received.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Spark ML Pipelines"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Start with a simple ML Pipelines:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml import Pipeline\n",
    "from pyspark.ml.classification import LogisticRegression\n",
    "from pyspark.ml.feature import HashingTF, Tokenizer\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remember that in Chapter 1 we split `description` into a list?\n",
    "Let's do it with the `Tokenizer` functionality instead!\n",
    "\n",
    "`Tokenizer` is part of the `pyspark.ml.feature`. `pyspark.ml.feature` give us many out of the box functionality for feature extraction. Feature extraction is the _data-science_ way of transforming columns into a new one."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load saved data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = spark.read.parquet('train_data_only_description')\n",
    "(trainingData, testData) = data.randomSplit([0.7, 0.3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tokenizer, HashingTF, and Logistic Regression\n",
    "\n",
    "✅ **Task :** \n",
    "\n",
    "Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr:\n",
    "\n",
    "\n",
    "Use the next code sample and adjust it to your needs :\n",
    "```python\n",
    "tokenizer = Tokenizer(inputCol=\"description\", outputCol=\"words\")\n",
    "hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol=\"features\")\n",
    "lr = LogisticRegression(maxIter=10, regParam=0.001)\n",
    "pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])\n",
    "\n",
    "```\n",
    "\n",
    "\n",
    "After we understand that Linear Regression might not be good enough for our data science purposes, we are going to work with Logistic Regression. This is your **3rd** Machine Learning model with Spark ML 🎉"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# your code goes here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Call fit on Pipeline to get the model:\n",
    "\n",
    "If it fails here, validates that `description` doesn't have null values.\n",
    "Our HashingTF doesn't know how to handle null values.\n",
    "If those exist, create a new DataFrame without them and use the new DataFrame to build the model.\n",
    "\n",
    "In case you need it:\n",
    "``` python\n",
    "    trainingData = trainingData.dropna('description')\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fit the pipeline to training documents.\n",
    "model = pipeline.fit(trainingData)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make predictions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make predictions on test documents and print columns of interest.\n",
    "prediction = model.transform(testData)\n",
    "selected = prediction.select(\"description\", \"probability\", \"prediction\")\n",
    "for row in selected.collect():\n",
    "    description, prob, prediction = row\n",
    "    print(\"(%s) --> prob=%s, prediction=%f\" % (description, str(prob), prediction))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the text output search for\n",
    "`prediction=1`\n",
    "\n",
    "And write in the chat which description got classified as bot!\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 3: Put everything together"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### CrossValidator, BinaryClassificationEvaluator and ParamGridBuilder functionality\n",
    "\n",
    "✅ **Task :** \n",
    "\n",
    "\n",
    "CrossValidator provide us the ability to run multiple training set and testing set within one function call - \n",
    "`fit`.\n",
    "\n",
    "It runs the evaluation phase and chooses the best parameters.\n",
    "\n",
    "Read about `CrossValidator` in the [docs](https://spark.apache.org/docs/latest/ml-tuning.html) and integrate it into your pipeline.\n",
    "\n",
    "In the docs, search for `CrossValidator` `python` example.\n",
    "\n",
    "Copy the example to the notebook and adjust it to your needs.\n",
    "\n",
    "From the docs:\n",
    ">`CrossValidator` - K-fold cross validation performs model selection by splitting the dataset into a set of non-overlapping randomly partitioned folds which are used as separate training and test datasets e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing. Each fold is used as the test set exactly once.\n",
    "\n",
    "<details><summary>Can't find the example, click here! </summary>\n",
    "<p>\n",
    "\n",
    "```python\n",
    "from pyspark.ml import Pipeline\n",
    "from pyspark.ml.classification import LogisticRegression\n",
    "from pyspark.ml.evaluation import BinaryClassificationEvaluator\n",
    "from pyspark.ml.feature import HashingTF, Tokenizer\n",
    "from pyspark.ml.tuning import CrossValidator, ParamGridBuilder\n",
    "\n",
    "# Prepare training documents, which are labeled.\n",
    "training = spark.createDataFrame([\n",
    "    (0, \"a b c d e spark\", 1.0),\n",
    "    (1, \"b d\", 0.0),\n",
    "    (2, \"spark f g h\", 1.0),\n",
    "    (3, \"hadoop mapreduce\", 0.0),\n",
    "    (4, \"b spark who\", 1.0),\n",
    "    (5, \"g d a y\", 0.0),\n",
    "    (6, \"spark fly\", 1.0),\n",
    "    (7, \"was mapreduce\", 0.0),\n",
    "    (8, \"e spark program\", 1.0),\n",
    "    (9, \"a e c l\", 0.0),\n",
    "    (10, \"spark compile\", 1.0),\n",
    "    (11, \"hadoop software\", 0.0)\n",
    "], [\"id\", \"text\", \"label\"])\n",
    "\n",
    "# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.\n",
    "tokenizer = Tokenizer(inputCol=\"text\", outputCol=\"words\")\n",
    "hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol=\"features\")\n",
    "lr = LogisticRegression(maxIter=10)\n",
    "pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])\n",
    "\n",
    "# We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.\n",
    "# This will allow us to jointly choose parameters for all Pipeline stages.\n",
    "# A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.\n",
    "# We use a ParamGridBuilder to construct a grid of parameters to search over.\n",
    "# With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,\n",
    "# this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.\n",
    "paramGrid = ParamGridBuilder() \\\n",
    "    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \\\n",
    "    .addGrid(lr.regParam, [0.1, 0.01]) \\\n",
    "    .build()\n",
    "\n",
    "crossval = CrossValidator(estimator=pipeline,\n",
    "                          estimatorParamMaps=paramGrid,\n",
    "                          evaluator=BinaryClassificationEvaluator(),\n",
    "                          numFolds=2)  # use 3+ folds in practice\n",
    "\n",
    "# Run cross-validation, and choose the best set of parameters.\n",
    "cvModel = crossval.fit(training)\n",
    "\n",
    "# Prepare test documents, which are unlabeled.\n",
    "test = spark.createDataFrame([\n",
    "    (4, \"spark i j k\"),\n",
    "    (5, \"l m n\"),\n",
    "    (6, \"mapreduce spark\"),\n",
    "    (7, \"apache hadoop\")\n",
    "], [\"id\", \"text\"])\n",
    "\n",
    "# Make predictions on test documents. cvModel uses the best model found (lrModel).\n",
    "prediction = cvModel.transform(test)\n",
    "selected = prediction.select(\"id\", \"text\", \"probability\", \"prediction\")\n",
    "for row in selected.collect():\n",
    "    print(row)\n",
    "```\n",
    "    \n",
    "</p>\n",
    "</details>\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details><summary>Answer</summary>\n",
    "<p>\n",
    "\n",
    "```python\n",
    "from pyspark.ml import Pipeline\n",
    "from pyspark.ml.classification import LogisticRegression\n",
    "from pyspark.ml.evaluation import BinaryClassificationEvaluator\n",
    "from pyspark.ml.feature import HashingTF, Tokenizer\n",
    "from pyspark.ml.tuning import CrossValidator, ParamGridBuilder\n",
    "\n",
    "\n",
    "# We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.\n",
    "# This will allow us to jointly choose parameters for all Pipeline stages.\n",
    "# A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.\n",
    "# We use a ParamGridBuilder to construct a grid of parameters to search over.\n",
    "# With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,\n",
    "# this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.\n",
    "paramGrid = ParamGridBuilder() \\\n",
    "    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \\\n",
    "    .addGrid(lr.regParam, [0.1, 0.01]) \\\n",
    "    .build()\n",
    "\n",
    "\n",
    "crossval = CrossValidator(estimator=pipeline,\n",
    "                          estimatorParamMaps=paramGrid,\n",
    "                          evaluator=BinaryClassificationEvaluator(),\n",
    "                          numFolds=3)  # use 3+ folds in practice\n",
    "\n",
    "\n",
    "# Run cross-validation, and choose the best set of parameters.\n",
    "cvModel = crossval.fit(trainingData)\n",
    "\n",
    "prediction = cvModel.transform(testData)\n",
    "selected = prediction.select(\"description\", \"probability\", \"prediction\")\n",
    "for row in selected.collect():\n",
    "    print(row)\n",
    "   \n",
    "```\n",
    "    \n",
    "</p>\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# your code goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the text output search for\n",
    "`prediction=1`\n",
    "\n",
    "Write in the chat which `description` got classified as a bot!\n",
    "\n",
    "Notice that in some of the `description` exists the word - `bot`.\n",
    "\n",
    "Meaning your algorithm found it without being told directly to search for the word bot 🤓"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the last task, you used `BinaryClassificationEvaluator` since it is more accurate to our needs. It works with Binary data - bot or human.\n",
    "`ParamGridBuilder` is a utility that helps us construct a parameter grid for our algorithm. It helps us test out various models built with various params. \n",
    "\n",
    "`ParamGridBuilder` is part of `pyspark.ml.tuning` lib. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Well Done! 👏👏👏\n",
    "## You just finished chapter 4: Spark ML - Evaluating models and intro to Pipelines\n",
    "## This was the last chapter for **Apache Spark ML First Steps**\n",
    "## I hope you enjoyed it!\n",
    "\n",
    "\n",
    "[@adipolak](https://twitter.com/AdiPolak)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
